Utilizing Embeddings for Ad-hoc Retrieval by Document-to-document Similarity
نویسندگان
چکیده
Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q) similarity in their embeddings. As words with similar syntactic usage are usually very close to each other in the embeddings space, although they are not semantically similar, the D2Q similarity approach may suffer from the problem of “multiple degrees of similarity”. To this end, this paper proposes a novel approach that estimates a semantic relevance score (SEM) based on document-to-document (D2D) similarity of embeddings. As Word or Para2Vec generates embeddings by the context of words/paragraphs, the D2D similarity approach turns the task of document ranking into the estimation of similarity between content within different documents. Experimental results on standard TREC test collections show that our proposed approach outperforms strong baselines.
منابع مشابه
Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval
A major difficulty in applying word vector embeddings in information retrieval is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the purpose of indexing and scoring documents. Instead of striving for a suitable method to obtain a single vector representation of a large doc...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملDocument Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملUESTC at ImageCLEF 2012 Medical Tasks
This paper describes the methods used and results archived by our research group in the ImageCLEF 2012 medical retrieval and classification tasks. We performed three sub-tasks, ad-hoc retrieval, case-based retrieval, and modality classification. For the retrieval tasks, we combined semantic-based retrieval with traditional text-based retrieval. The semantic-based retrieval was conducted by comp...
متن کاملQuery Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in the natural language processing and machine learning communities for their ability to model term similarity and other relationships. We study the use of term relatedness in the context of query expansion for ad hoc information retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when trained global...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1708.03181 شماره
صفحات -
تاریخ انتشار 2017